CS 294 - 1 Homework 1 Mobin Javed Collaborators

نویسنده

  • Mobin Javed
چکیده

First a naive implementation with a feature selection containing few features, results in poor accuracy because (i) the stop words and non-words haven’t been filtered and turn out as the features with high probability, and (ii) with few features the dictionary might be biased towards one class. Table 1 shows the positive and negative dictionary formed by selecting the top ten features from each class for a baseline implementation. These features do not achieve good accuracy because clearly the word ‘life’ is neither indicative of a positive nor a negative review, yet it is present as one of the ten features. Secondly, we see that the positive dictionary also contains a bunch of words that indicate a negative sentiment i.e. stupid, worst, boring. Feature selection with ten features on the baseline implementation results in a precision of only 0.65 when alpha smoothing = 0.01 is carried out. Let’s see whether bigrams help if we want to select a few features without any preprocessing (for e.g. stopword filtering) of the tokens. Table 2 shows the top ten bigrams for each class. Clearly, ‘as the’ and ‘he is’ shouldn’t have appeared as indicative of either the positive or negative class. Further negative bi-grams, such as ‘the worst’, ‘waste of’ also show up in positive indicators.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

CS 5314 Randomized Algorithms

Homework 1 (Suggested Solutions) 1. Ans. Use principle of deferred decision. Let X i denote the outcome of the i-th die so that Pr(X i = k) = 1/6, where 1 ≤ k ≤ 6.

متن کامل

CS 294 - 1 Homework 3 Timothy Hunter and Andre

In this assignment, the goal was to parse a large set of Wikipedia articles, to extract features into a sparse feature matrix and to cluster them with a clustering algorithm of our choice. One motivation to perform automated clustering on unstructured, unlabeled data is to detect correlations between data points; for instance, in the case of Wikipedia, one might be able to automatically group a...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2012